-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3695: extend pod resource API to for DRA #3738
Conversation
Hi @moshe010. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
/assign |
/cc |
c859f61
to
65cfcd2
Compare
This PR seems to be missing the required file in Just to make sure, did you also follow the latest template for the KEP itself?: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally fine with the content of the KEP here, I would just like to have documented better why DRA can't provide data useful to GetAllocatableResources
, and if this status can possibly change in the future. Few lines are sufficient.
To enhance pod resource API GetAllocatableResources for resources that managed by DRA. With DRA there is not standard | ||
way to get resource captivity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: I understand we can't provide something in turn DRA is not providing, but making the API more irregular makes me pause. Can we try to list options here to sketch some plans for the future? It would also be nice to spend few lines to explain why DRA can't provide this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this issue of common way to get resource captivity with DRA should be tracked as separate issue?
@klueska maybe we can standardize the nodeallocationstates CRD https://gitlab.com/nvidia/cloud-native/k8s-dra-driver/-/blob/main/deployments/static/crds/gpu.resource.nvidia.com_nodeallocationstates.yaml as CRD for the DRA kubelet plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how we would (or even could) provide Capacity or Allocatable for DRA managed resources.
To answer your question directly, I don't think it's reasonable to assume there will be a standard representation for each resource managed by a DRA resource driver. It's not even required that a CRD be used for this purpose, let alone adhere to a specific format.
Besides that, the notion of a "count" for a DRA managed resource is itself not well defined since each resource that gets allocated can a take drastically different form depending on what ClaimParameter objects are used to request them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it about a third of the way through and need to take a break. Will come back to it soon.
I wasn't aware of it. Who should I put as the approver? should I just pick one from the list
|
6c00f11
to
284623e
Compare
Yes, please pick one. The PRR reviewers will rebalance the load internally and suggest a replacement should the one you picked turn out to be overloaded. |
Signed-off-by: Moshe Levi <moshele@nvidia.com>
24f9123
to
ccacf27
Compare
Our proposal is to extend the existing PodResources gRPC service of the Kubelet with a repeated DynamicResource field in the ContainerResources message. This new field will contain information about the DRA resource class, the DRA resource claim and a list of CDI Devices allocated by a DRA driver. | ||
Additionally, we propose adding a Get method to the existing gRPC service to allow querying specific pods for their allocated resources. | ||
|
||
The extended interface is shown in proto below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the proposed API is read-only and mainly for the monitoring, but not sure the entire flow. May I have a couple of questions answered before approving this KEP:
- We are using polling here. But what if DRA driver is dead, is the pod using the resource continue running? How to make sure the stats up to date?
- DRA might be deallocated on the node, how often kubelet to issue Get() methods to reconsolidate the status?
- Do we need a registration logic with Kubelet?
I guess some of the questions might be discussed in a separate KEP, but hope we have the answers to those high level questions. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For (1):
I guess you could think of it as polling, but the PodResources API is really more of an on-demand API that goes and gets the information it needs at the time it is requested. Nothing is cached as part of the API itself. The various components that hold the information (CPUManager, DeviceManager, etc) may cache this information, but the API doesn't "store" this information anywhere by itself.
In the case of the DRA manager, the information required to populate the new fields in the PodResources API is all cached in an in-memory store for each container that gets launched. However, unlike the CPUManager and DeviceManager, this information is not persisted / checkpointed by the DRA manager itself (meaning the cache cannot be directly repopulated across a kubelet restart). It is idempotently queriable from each DRA driver though. Meaning that the in-memory cache can be used to populate the API if the information is available, otherwise the cache can be filled by querying the DRA driver and then pushed up to the API as needed.
In response to this specific question: "But what if DRA driver is dead, is the pod using the resource continue running?"
The lifetime of a pod is not (necessarily) tied to the "readiness" of the DRA drivers that manage its resource claims. The only time a pod is blocked from deletion by a claim not being deleted (possibly because the DRA driver managing it is down) is if it is the last pod referencing that claim.
So I think what you are asking is "If a pod is running, but I can't communicate with one of the DRA drivers to query its list of DynamicResources (and this information is not available in the DRA managers in-memory cache), what do I do? Do I omit this information? Do I return an error?, etc."
Does this accurately reflect what you are trying to ask?
For (2):
What do you mean by "DRA might be deallocated on the node"? Do you mean the resources associated with a specific DRA driver might be deallocated on a node? And the question is how often the kubelet will need to consult with the DRA driver to refresh whatever information it may have about these resources? I think I answered this in my response to (1) when talking about the in-memory cache of the DRA manager and it ability to idempotently retrieve any missing information from the DRA driver.
The Get()
call you refer to is something exposed by the kubelet (not something the kubelet calls itself), so I assume you actually meant NodePrepareResources()
(which is the idempotent call a kubelet can make against a DRA driver to get the list of CDI devices associated with a claim).
For (3):
What are you proposing should have registration logic with the kubelet? DRA resource drivers already do register with the kubelet (so it can make calls against them to Prepare/Unprepare resources). I think I need more context here to answer the question properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchen1107
I have updated the Proposed API
section with more details that hopefully address your comments / concerns. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klueska Thanks for the detailed response.
You answered my questions. I tried to figure out how idempotent the behavior is since DRA is dynamic, the the status how persistent and consistent over the kubelet and / or DRA driver restarts. Even we might still have the inconsistency over the restart, but it is expected to some extent. I am accepting it for now.
Also the KEP heavily depends on the DRA, but KEP doesn't include a high level component diagram capturing the dependencies and interfaces. It assumes the reviewers have all information. My original wish is to answer my questions, we could address the issue for the future readers. I guess I didn't state it clearly. I am ok with it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. They were important questions to ask.
Regarding:
Even we might still have the inconsistency over the restart, but it is expected to some extent.
I don’t think we will have any inconstancies. The PodResources API is provided by the kubelet itself, so if it is down the API is inaccessible anyway. And once it comes back up it will reconcile all of the info it needs before it starts serving the API again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will there be any race if API is called too early before everything got registered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for plugins not responding - there will be no partial results, correct? Either all known information is returned or an error. So one plugin can be a "noisy" neighbor for the API. Should this be added as a risk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will there be any race if API is called too early before everything got registered?
There shouldn't be. The set of CDI devices to allocate to a container is pulled from the various DRA plugins before a container is ever created. Either all plugins successfully allocate devices (thus populating the cache) and the container eventually gets created / starts running, or some plugin fails and the container start fails as well. Meaning only containers whose cached CDI device info is complete will ever be created and thus accessible through the API.
As for plugins not responding - there will be no partial results, correct? Either all known information is returned or an error. So one plugin can be a "noisy" neighbor for the API. Should this be added as a risk?
That shouldn't matter. If a plugin does not respond, then the container will never be started and thus not something queriable through the PodResources API.
fd9da2f
to
ccacf27
Compare
Signed-off-by: Kevin Klues <kklues@nvidia.com>
404fbd9
to
1191ad8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have couple PRR questions. Since this proposal highly depends on DRA, I guess we should carefully write the different/dependency here which are all questions about.
/cc @johnbelamaric
|
||
###### Does this feature depend on any specific services running in the cluster? | ||
|
||
No. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it require DRA working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires DRA to be available, but DRA is not a service (it is something built into the kubelet / default scheduler). Simply ensuring that the DRA feature gate is enabled is enough to ensure that DRA is available, so I tend to agree that this does not depend on any services running in the cluster (as stated currently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this section to be the same as in the DRA kep https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3063-dynamic-resource-allocation/README.md#dependencies
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a few minor updates the PRR answers but overall PRR looks good.
Signed-off-by: Moshe Levi <moshele@nvidia.com>
091b230
to
eaa6675
Compare
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, johnbelamaric, klueska, moshe010 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
owning-sig: sig-node | ||
participating-sigs: [] | ||
status: provisional | ||
creation-date: implementable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this field should be the date created, right now this gives the misleading impression that the kep is approved "implementable" when the status is "provisional" and it is missing the date
see e.g.
creation-date: 2023-01-05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like something got screwed up here.
The date should definitely be fixed, but the status should, in fact, be implementable, not provisional (the code that implements this KEP was merged in kubernetes/kubernetes#115847).
### Goals | ||
|
||
- To allow node monitoring agents to know the allocated DRA resources for Pods on a node. | ||
- To allow the DRA feature to work with CNIs that require complex network devices such as RDMA. DRA resource drivers will allocate the resources, and the meta-plugin will read the allocated [CDI Devices](https://github.com/container-orchestrated-devices/container-device-interface) using the PodResources API. The meta-plugin will then inject the device-id of these CDI Devices as CNI arguments and invoke other CNIs (just as it does for devices allocated by the device plugin today). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach was not discussed in SIG Network and WG Multinetwork forums AFAIK, it will be better to involve all the stakeholders in these discussions to be able to get a uniform solution.
The recent developments and consensus for the Networking extension with DRA are described in #4861
|
||
#### Beta | ||
|
||
- [ ] Gather feedback from consumers of the DRA feature and k8snetworkplumbingwg working group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and SIG Network
Signed-off-by: Moshe Levi moshele@nvidia.com
Extend pod resource API to for Dynamic Resource Allocation
DRA: Extend PodResources to include resources from Dynamic Resource Allocation #3695